UTF8 strings lowercase conversion #26

p-lambert · 2014-03-07T19:36:06Z

Currently, process_text converts the given string to lowercase in order to perform the word matching, but unfortunately Ruby does not convert UTF8 strings properly.
I've noticed inconsistencies like this one:

require 'whatlanguage'
puts "ÂNCORA COR ÂMBAR".language # => spanish
puts "âncora cor âmbar".language # => portuguese

Thanks for the library!

peterc · 2014-03-07T08:07:49Z

This is a general problem in Ruby. Do you know of any reasonable solutions?

The problem here is WhatLanguage in its current form is dependent on words and having all combinations of casing in the word lists is impractical, so we have to normalize them somehow. Is there a better way to do this normalization?

p-lambert · 2014-03-07T14:18:19Z

I did some research on that and there is no simple solution (like 1-to-1 mappings covering all scenarios) as long as there are several conditions to be taken into account, and mostly because some of them are locale dependent (see, for example, Character Properties, Case Mappings & Names FAQ).

Thus we get stuck in a circular problem: we need to normalize the string in order to identify the language and ideally the language must be taken into account in this process of normalization.

Although this seems rather disappointing, I really believe results would be greatly improved if we at least performed those simple conversions (i.e., the case folding as specified by Unicode), even disregarding these locale dependent rules.

Of course this casing conversion goes beyond the scope of this library, so I would propose to use an external one. https://github.com/lang/unicode_utils seems to do the trick and appears to be well written, using official specifications from Unicode. We could dynamically define a to_lowercase method which would either delegate this conversion to UnicodeUtils if defined or simply perform this by String#downcase. That way the user could optionally require the aforementioned library and it would not be a dependency. This sounds too ugly?

peterc · 2014-03-07T17:37:51Z

I concur. The plan for the next version of WhatLanguage mitigates this somewhat as it will include using histograms of Unicode codepoint usage, but this approach may still be useful.

I think your suggestion in the last paragraph makes sense. Do you want to have a quick attempt at it or would you prefer me to look at it?

p-lambert · 2014-03-07T18:35:21Z

I'll try something! Thanks

p-lambert · 2014-03-12T13:35:44Z

@peterc, any comments on that?

peterc · 2014-03-14T05:01:47Z

I think it's a nice, gentle, mostly hands-off approach that could work for now, so thanks! I'll merge it in :-)

UTF8 strings lowercase conversion

Fixed lowercase conversion on unicode strings

57a5195

peterc added a commit that referenced this pull request Mar 14, 2014

Merge pull request #26 from p-lambert/unicode-lowercase

4b8212e

UTF8 strings lowercase conversion

peterc merged commit 4b8212e into peterc:master Mar 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF8 strings lowercase conversion #26

UTF8 strings lowercase conversion #26

p-lambert commented Mar 7, 2014

peterc commented Mar 7, 2014

p-lambert commented Mar 7, 2014

peterc commented Mar 7, 2014

p-lambert commented Mar 7, 2014

p-lambert commented Mar 12, 2014

peterc commented Mar 14, 2014

UTF8 strings lowercase conversion #26

UTF8 strings lowercase conversion #26

Conversation

p-lambert commented Mar 7, 2014

peterc commented Mar 7, 2014

p-lambert commented Mar 7, 2014

peterc commented Mar 7, 2014

p-lambert commented Mar 7, 2014

p-lambert commented Mar 12, 2014

peterc commented Mar 14, 2014